88 research outputs found

    Distributed Data Summarization in Well-Connected Networks

    Get PDF
    We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph G of n nodes each of which may hold a value initially, we focus on computing sum_{i=1}^N g(f_i), where f_i is the number of occurrences of value i and g is some fixed function. This includes important statistics such as the number of distinct elements, frequency moments, and the empirical entropy of the data. In the CONGEST~ model, a simple adaptation from streaming lower bounds shows that it requires Omega~(D+ n) rounds, where D is the diameter of the graph, to compute some of these statistics exactly. However, these lower bounds do not hold for graphs that are well-connected. We give an algorithm that computes sum_{i=1}^{N} g(f_i) exactly in {tau_{G}} * 2^{O(sqrt{log n})} rounds where {tau_{G}} is the mixing time of G. This also has applications in computing the top k most frequent elements. We demonstrate that there is a high similarity between the GOSSIP~ model and the CONGEST~ model in well-connected graphs. In particular, we show that each round of the GOSSIP~ model can be simulated almost perfectly in O~({tau_{G}}) rounds of the CONGEST~ model. To this end, we develop a new algorithm for the GOSSIP~ model that 1 +/- epsilon approximates the p-th frequency moment F_p = sum_{i=1}^N f_i^p in O~(epsilon^{-2} n^{1-k/p}) roundsfor p >= 2, when the number of distinct elements F_0 is at most O(n^{1/(k-1)}). This result can be translated back to the CONGEST~ model with a factor O~({tau_{G}}) blow-up in the number of rounds

    Finding Subcube Heavy Hitters in Analytics Data Streams

    Full text link
    Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of dd-dimensional items x1,,xm[n]dx_1,\ldots,x_m \in [n]^d. A kk-dimensional subcube TT is a subset of distinct coordinates {T1,,Tk}[d]\{ T_1,\cdots,T_k \} \subseteq [d]. A subcube heavy hitter query Query(T,v){\rm Query}(T,v), v[n]kv \in [n]^k, outputs YES if fT(v)γf_T(v) \geq \gamma and NO if fT(v)<γ/4f_T(v) < \gamma/4, where fTf_T is the ratio of number of stream items whose coordinates TT have joint values vv. The all subcube heavy hitters query AllQuery(T){\rm AllQuery}(T) outputs all joint values vv that return YES to Query(T,v){\rm Query}(T,v). The one dimensional version of this problem where d=1d=1 was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in O~(kd/γ)\tilde{O}(kd/\gamma) space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is Θ(d2/γ)\Theta(d^2/\gamma) which is prohibitive for large dd, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass, O~(d/γ)\tilde{O}(d/\gamma)-space algorithm for our problem, and a fast algorithm for answering AllQuery(T){\rm AllQuery}(T) in O(k/γ2)O(k/\gamma^2) time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.Comment: To appear in WWW 201

    Better Streaming Algorithms for the Maximum Coverage Problem

    Get PDF
    We study the classic NP-Hard problem of finding the maximum k-set coverage in the data stream model: given a set system of m sets that are subsets of a universe {1,...,n}, find the k sets that cover the most number of distinct elements. The problem can be approximated up to a factor 1-1/e in polynomial time. In the streaming-set model, the sets and their elements are revealed online. The main goal of our work is to design algorithms, with approximation guarantees as close as possible to 1-1/e, that use sublinear space o(mn). Our main results are: 1) Two (1-1/e-epsilon) approximation algorithms: One uses O(1/epsilon) passes and O(k/epsilon^2 polylog(m,n)) space whereas the other uses only a single pass but O(m/epsilon^2 polylog(m,n)) space. 2) We show that any approximation factor better than (1-(1-1/k)^k) in constant passes require space that is linear in m for constant k even if the algorithm is allowed unbounded processing time. We also demonstrate a single-pass, (1-epsilon) approximation algorithm using O(m/epsilon^2 min(k,1/epsilon) polylog(m,n)) space. We also study the maximum k-vertex coverage problem in the dynamic graph stream model. In this model, the stream consists of edge insertions and deletions of a graph on N vertices. The goal is to find k vertices that cover the most number of distinct edges. We show that any constant approximation in constant passes requires space that is linear in N for constant k whereas O(N/epsilon^2 polylog(m,n)) space is sufficient for a (1-epsilon) approximation and arbitrary k in a single pass. For regular graphs, we show that O(k/epsilon^3 polylog(m,n)) space is sufficient for a (1-epsilon) approximation in a single pass. We generalize this to a K-epsilon approximation when the ratio between the minimum and maximum degree is bounded below by K

    Distributed Dense Subgraph Detection and Low Outdegree Orientation

    Get PDF

    Maximum Coverage in the Data Stream Model: Parameterized and Generalized

    Get PDF
    We present algorithms for the Max-Cover and Max-Unique-Cover problems in the data stream model. The input to both problems are mm subsets of a universe of size nn and a value k[m]k\in [m]. In Max-Cover, the problem is to find a collection of at most kk sets such that the number of elements covered by at least one set is maximized. In Max-Unique-Cover, the problem is to find a collection of at most kk sets such that the number of elements covered by exactly one set is maximized. Our goal is to design single-pass algorithms that use space that is sublinear in the input size. Our main algorithmic results are: If the sets have size at most dd, there exist single-pass algorithms using O~(dd+1kd)\tilde{O}(d^{d+1} k^d) space that solve both problems exactly. This is optimal up to polylogarithmic factors for constant dd. If each element appears in at most rr sets, we present single pass algorithms using O~(k2r/ϵ3)\tilde{O}(k^2 r/\epsilon^3) space that return a 1+ϵ1+\epsilon approximation in the case of Max-Cover. We also present a single-pass algorithm using slightly more memory, i.e., O~(k3r/ϵ4)\tilde{O}(k^3 r/\epsilon^{4}) space, that 1+ϵ1+\epsilon approximates Max-Unique-Cover. In contrast to the above results, when dd and rr are arbitrary, any constant pass 1+ϵ1+\epsilon approximation algorithm for either problem requires Ω(ϵ2m)\Omega(\epsilon^{-2}m) space but a single pass O(ϵ2mk)O(\epsilon^{-2}mk) space algorithm exists. In fact any constant-pass algorithm with an approximation better than e/(e1)e/(e-1) and e11/ke^{1-1/k} for Max-Cover and Max-Unique-Cover respectively requires Ω(m/k2)\Omega(m/k^2) space when dd and rr are unrestricted. En route, we also obtain an algorithm for a parameterized version of the streaming Set-Cover problem.Comment: Conference version to appear at ICDT 202

    On the Locality of Nash-Williams Forest Decomposition and Star-Forest Decomposition

    Full text link
    Given a graph G=(V,E)G=(V,E) with arboricity α\alpha, we study the problem of decomposing the edges of GG into (1+ϵ)α(1+\epsilon)\alpha disjoint forests in the distributed LOCAL model. Barenboim and Elkin [PODC `08] gave a LOCAL algorithm that computes a (2+ϵ)α(2+\epsilon)\alpha-forest decomposition using O(lognϵ)O(\frac{\log n}{\epsilon}) rounds. Ghaffari and Su [SODA `17] made further progress by computing a (1+ϵ)α(1+\epsilon) \alpha-forest decomposition in O(log3nϵ4)O(\frac{\log^3 n}{\epsilon^4}) rounds when ϵα=Ω(αlogn)\epsilon \alpha = \Omega(\sqrt{\alpha \log n}), i.e. the limit of their algorithm is an (α+Ω(αlogn))(\alpha+ \Omega(\sqrt{\alpha \log n}))-forest decomposition. This algorithm, based on a combinatorial construction of Alon, McDiarmid \& Reed [Combinatorica `92], in fact provides a decomposition of the graph into \emph{star-forests}, i.e. each forest is a collection of stars. Our main result in this paper is to reduce the threshold of ϵα\epsilon \alpha in (1+ϵ)α(1+\epsilon)\alpha-forest decomposition and star-forest decomposition. This further answers the 10th10^{\text{th}} open question from Barenboim and Elkin's "Distributed Graph Algorithms" book. Moreover, it gives the first (1+ϵ)α(1+\epsilon)\alpha-orientation algorithms with {\it linear dependencies} on ϵ1\epsilon^{-1}. At a high level, our results for forest-decomposition are based on a combination of network decomposition, load balancing, and a new structural result on local augmenting sequences. Our result for star-forest decomposition uses a more careful probabilistic analysis for the construction of Alon, McDiarmid, \& Reed; the bounds on star-arboricity here were not previously known, even non-constructively
    corecore